This workshop is mainly adapted from the mixOmics tutorials (http://mixomics.org) and the Bioconductor vignette (https://bioconductor.org/packages/release/bioc/html/mixOmics.html).
Partial least squares discriminant analysis (PLS-DA) is a statistical method that combines the techniques of partial least squares (PLS) regression and linear discriminant analysis (LDA).
PLS-DA works by first finding the linear combinations of the predictor variables (X) that are most correlated with the response variable (Y). These linear combinations are called latent variables. The latent variables are then used to train a LDA model, which can be used to classify new samples.
PLS-DA has several advantages over other classification methods. First, it is relatively robust to collinearity, which is a problem that can occur in data sets with many correlated variables. Second, PLS-DA can handle data sets with a large number of predictor variables. Third, PLS-DA can be used to visualize data, which can be helpful for understanding the relationships between the predictor variables and the response variable.
library(mixOmics)
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
##
## Loaded mixOmics 6.24.0
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us: citation('mixOmics')
data(breast.TCGA)
# Extract training data and name each data frame
# There are 3 dataframes (3 omics of the same samples), stored as list
# In each dataframe, the rows represent samples.
# Columns represent feature measurments (genes, miRNA, protein)
X <- list(mRNA = breast.TCGA$data.train$mrna,
miRNA = breast.TCGA$data.train$mirna,
protein = breast.TCGA$data.train$protein)
Y <- breast.TCGA$data.train$subtype
table(Y)
## Y
## Basal Her2 LumA
## 45 30 75
We can use the plsda() function to find combinations of features that discriminate the groups.
The inputs are a matrix of values (sample-rows by feature-columns, X), together with a vector of the groups (Y).
We also need to choose the number of latent components (starting here with ncomp=3), and whether to add scaling and logratio (as seen in Session 1 with pca()).
plsda.mRNA <- plsda(X$mRNA, Y=Y, ncomp = 3, scale = TRUE, logratio = "CLR")
# plot the result
plotIndiv(plsda.mRNA, ind.names = FALSE, legend=TRUE, title="PLSDA on mRNA",
legend.title = "Cancer\nsubtype", ellipse = TRUE, ellipse.level = 0.6)
plotIndiv() applied to plsda objects projects the samples into plsda component subspace; components 1 and 2 by default. We can look at other components, which might show further separation of groups in complex datasets.
plotIndiv(plsda.mRNA, comp = c(1,3), ind.names = FALSE, legend=TRUE, title="PLSDA on mRNA",
legend.title = "Cancer\nsubtype", ellipse = TRUE, ellipse.level = 0.6)
PLSDA with a single component may be sufficient for some experiments (e.g. a binary group comparison of healthy vs disease samples), whereas experiments with many groups might be optimally separated using more components.
Each component is a composite of all features with a weight coefficient (i.e. loadings).
Loadings can be displayed using plotLoadings() and selectVar().
In PLSDA, all features are used, although the “weight” given to many of the features might be close to zero.
plotLoadings(plsda.mRNA, comp = 1, contrib = "max", method = "median")
plotLoadings(plsda.mRNA, comp = 2, contrib = "max", method = "median")
loadings.comp1 <- selectVar(plsda.mRNA, comp=1)
loadings.comp2 <- selectVar(plsda.mRNA, comp=2)
head(loadings.comp1$value)
## value.var
## ZNF552 -0.1405038
## SLC43A3 0.1373226
## PREX1 -0.1280330
## FMNL2 0.1274558
## CCNA2 0.1256307
## LMO4 0.1222428
head(loadings.comp2$value)
## value.var
## CDK18 -0.2059362
## TP53INP2 0.2032781
## TRIM45 -0.1901160
## STAT5A -0.1774087
## PLEKHA4 -0.1667928
## ZNF37B -0.1649502
There is usually a small fraction of features, which contribute majorly to the usefulness of each component, while adding too many features gives diminishing returns or negative impact (due to noise). The Sparse version of PLSDA (sPLSDA) is designed to select an optimal/minimal number of important features/components via tuning procedures.
General advice for choosing parameters in the tuning procedure:
The next two parameters determine how error/success the rate is defined, which will depend on the complexity of the data:
test.keepX <- 2^(1:7)
test.keepX
## [1] 2 4 8 16 32 64 128
set.seed(123)
tune.splsda.mRNA <- tune.splsda(X$mRNA, Y = Y, ncomp = 3, validation = "Mfold",
folds = 5, test.keepX = test.keepX,
dist = "centroids.dist",
scale = TRUE, logratio = "CLR",
nrepeat = 30, cpus = 3, progressBar = FALSE,
measure = "BER")
plot(tune.splsda.mRNA)
Inspect the performance (BER), which you want to minimise down, while increasing the number of components, and increasing the number of features used.
A few observations:
test.keepX <- round(2^seq(3,6.5,0.25))
test.keepX
set.seed(123)
tune.splsda.mRNA <- tune.splsda(X$mRNA, Y = Y, ncomp = 3, validation = "Mfold",
folds = 5, test.keepX = test.keepX,
dist = "centroids.dist",
scale = TRUE, logratio = "CLR",
nrepeat = 60, cpus = 3, progressBar = FALSE,
measure = "BER")
saveRDS(tune.splsda.mRNA, file = "tune.splsda.mRNA.rds")
tune.splsda.mRNA <- readRDS("tune.splsda.mRNA.rds")
plot(tune.splsda.mRNA)
# inspect the suggested optimal values of ncomp and keepX
tune.splsda.mRNA$choice.ncomp$ncomp # suggests 2 components
tune.splsda.mRNA$choice.keepX # comp1: keep 54 features, comp2: keep 23 features
Based on the last tuning procedure, let’s make a ‘tuned’ sPLSDA model using two components (with 54 and 23 genes).
tuned.splsda.mRNA <- splsda(X$mRNA, Y = Y, ncomp = 2, keepX = c(54, 23))
plotIndiv(tuned.splsda.mRNA, title="Tuned sPLSDA using 54 + 23 genes",
ind.names = FALSE, ellipse = TRUE, ellipse.level = 0.6,
legend.title = "Cancer\nSubtype", legend = TRUE)
# we can re-plot with a background-shading of the group boundary calls
background = background.predict(tuned.splsda.mRNA, comp.predicted=2, dist = "centroids.dist")
plotIndiv(tuned.splsda.mRNA, title="Tuned sPLSDA using 54 + 23 genes",
ind.names = FALSE, ellipse = TRUE, ellipse.level = 0.6,
legend.title = "Cancer\nSubtype", legend = TRUE,
background = background)
The separation between groups looks as good or better than the initial PLSDA model, and requires fewer features.
We can inspect the top-ranked chosen genes and their loadings.
plotLoadings(tuned.splsda.mRNA, comp = 1, contrib = "max", method = "median", size.name = 0.3)
plotLoadings(tuned.splsda.mRNA, comp = 2, contrib = "max", method = "median")
loadings.comp1 <- selectVar(tuned.splsda.mRNA, comp=1)
loadings.comp2 <- selectVar(tuned.splsda.mRNA, comp=2)
head(loadings.comp1$value)
## value.var
## ZNF552 -0.3509967
## KDM4B -0.3197369
## PREX1 -0.2712825
## LRIG1 -0.2691817
## CCNA2 0.2566730
## TTC39A -0.2460363
head(loadings.comp2$value)
## value.var
## TP53INP2 0.5245244
## CDK18 -0.5079990
## NDRG2 -0.3046127
## TRIM45 -0.2836764
## STAT5A -0.2636752
## PLEKHA4 -0.2417573
# save a list of selected imporant features to a csv file
write.csv(rbind(data.frame(loadings.comp1), data.frame(loadings.comp2)),
file = "splsda.mRNA.selectedFeatures.csv", row.names = FALSE)
Stability of the selected features can be reviewed with the perf() function, which repeats the selection process and estimates the mean squared error of prediction (MSEP), R2, and Q2 metrics.
The features with high stability/frequency are interpreted to be likely more integral to the model than the other features.
perf.splsda <- perf(tuned.splsda.mRNA,
validation = "Mfold", folds=5,
dist = "centroids.dist",
progressBar = FALSE, nrepeat = 30, cpus = 3)
# plot the stability of each feature for the first two components,
# 'h' type refers to plotting histogram
# A red vertical line is plotted at the number of features the model selects for each of these components.
par(mfrow=c(1,2))
plot(perf.splsda$features$stable$comp1, type = 'h',
ylab = 'Stability',
xlab = 'Features',
main = 'Comp 1', las =2,
ylim = c(0,1), xlim = c(0, 70))
abline(v = tuned.splsda.mRNA$keepX[1], col="red")
plot(perf.splsda$features$stable$comp2, type = 'h',
ylab = 'Stability',
xlab = 'Features',
main = 'Comp 2', las =2,
ylim = c(0,1), xlim = c(0, 70))
abline(v = tuned.splsda.mRNA$keepX[2], col="red")
cim() is a plotting function in mixOmics, which can generate a basic heat map of expression (or other values, depending on context). cim is an abbreviation of Clustered Image Map.
Note that saving the output (large image) as a file is recommended, rather than viewing in RStudio directly. Alternatively, you can open a new window for graphing with x11().
cim(tuned.splsda.mRNA, save = "png", name.save = "splsda.mRNA")
We can try making the heatmap more informative by:
legend=list(legend = levels(Y), # set of classes
col = unique(color.mixo(Y)), # set of colours
title = "Cancer Subtype", # legend title
cex = 0.7) # legend size
cim(tuned.splsda.mRNA, save = "png", name.save = "splsda.mRNA.comp1", comp = 1,
row.sideColors = color.mixo(Y), legend=legend, row.names = FALSE,
margins = c(5,2), title = "mRNA sPLSDA comp1", ylab = "Samples")
cim(tuned.splsda.mRNA, save = "png", name.save = "splsda.mRNA.comp2", comp = 2,
row.sideColors = color.mixo(Y), legend=legend, row.names = FALSE,
margins = c(5,2), title = "mRNA sPLSDA comp2", ylab = "Samples")
sPLS-DA models can be useful to find a good combination of distinctive features that are associated with the experimental groupings. These models can also be used to make predictions, classifying a new sample.
# Prepare test set data
data.test.mRNA <- breast.TCGA$data.test$mrna
Y.test <- breast.TCGA$data.test$subtype
table(Y.test)
## Y.test
## Basal Her2 LumA
## 21 14 35
predict.mRNA <- predict(tuned.splsda.mRNA, newdata = data.test.mRNA,
dist = "centroids.dist")
predict.mRNA # List the different outputs
##
## Call:
## predict.mixo_spls(object = tuned.splsda.mRNA, newdata = data.test.mRNA, dist = "centroids.dist")
##
## Main numerical outputs:
## --------------------
## Prediction values of the test samples for each component: see object$predict
## variates of the test samples: see object$variates
## Classification of the test samples for each distance and for each component: see object$class
head(predict.mRNA$predict)
## , , dim1
##
## Basal Her2 LumA
## A54N 0.8241670 0.2646864 -0.088853486
## A2NL 0.9853394 0.2845764 -0.269915847
## A6VY 0.7385149 0.2541163 0.007368833
## A3XT 0.8312773 0.2655639 -0.096841200
## A1F0 0.6333344 0.2411362 0.125529447
## A26I 0.7314737 0.2532473 0.015278979
##
## , , dim2
##
## Basal Her2 LumA
## A54N 1.0494133 -0.1409544132 0.09154107
## A2NL 1.1428287 0.0009576082 -0.14378632
## A6VY 0.7884695 0.1641541848 0.04737632
## A3XT 1.0221914 -0.0782488158 0.05605746
## A1F0 0.8016601 -0.0619977372 0.26033763
## A26I 0.6551407 0.3907136338 -0.04585434
confusion.mat.mRNA <- get.confusion_matrix(truth = Y.test,
predicted = predict.mRNA$MajorityVote$centroids.dist[,2])
confusion.mat.mRNA
## predicted.as.Basal predicted.as.Her2 predicted.as.LumA
## Basal 20 1 0
## Her2 0 14 0
## LumA 0 1 34
get.BER(confusion.mat.mRNA)
## [1] 0.02539683